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Abstract : 


^  This  paper  examines  the  use  of  binary  trees  in  the 
design  of  efficient  parallel  algorithms.  Using  binary 
trees,  we  develop  efficient  algorithms  for  several  schedul¬ 
ing  problems.  The  shared  memory  model  for  parallel  computa¬ 
tion  is  used.  Our  success  in  using  binary  trees  for  parallel 
computations,  indicates  that  the  binary  tree  is  an  important 
and  useful  design  tool  for  parallel  algorithms.  ^ 
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i .  Introduction 

Algorithm  design  techniques  for  single  processor  computers 
have  been  extensively  studied.  For  example  Horowitz  and 
Sahni  [15]  extoll  the  virtues  of  such  design  methods  as: 
di vide-and-conquer ;  dynamic  programming;  greedy  method; 
backtracking;  and  branch-and-bound .  These  methods  generally 
lead  to  efficient  sequential  (i.e.  single  processor)  algo¬ 
rithms  for  a  variety  of  problems.  These  algorithms,  how¬ 
ever,  are  not  very  efficient  for  computers  with  a  very  large 
number  of  processsors.  In  this  paper,  we  propose  a  design 
method  that  we  have  found  useful  in  the  design  of  algorithms 
for  computers  that  have  many  processors.  The  method  pro¬ 
posed  here  is  called  the  binary  tree  method .  While  this 
method  has  been  used  in  the  design  of  parallel  algorithms 
earlier;  here  we  attempt  to  show  its  broad  applicability  to 
the  design  of  such  algorithms.  It  is  hoped  that  further 
research  will  bring  to  light  some  other  basic  design  tools 
for  parallel  algorithms.  One  should  note  that  trees  have 
been  used  extensively  in  the  design  of  efficient  sequential 
algorithms.  In  fact,  devide-and-conquer ;  backtracking;  and 
branch-and-bound  all  use  an  underlying  computation  tree 
[15].  The  use  of  binary  trees  as  proposed  here  is  quite 
different  from  the  use  of  trees  in  sequential  computation. 

With  the  continuing  dramatic  decline  in  the  cost  of 
hardware,  it  is  becoming  feasible  to  economically  build  com¬ 
puters  with  thousands  of  processors.  In  fact,  Batcher  [5] 
describes  a  computer  (MPP)  with  16,384  processors  that  is 
currently  being  built  for  NASA.  In  coming  years,  one  can 
expect  to  see  computers  with  a  hundred  thousand  or  even  a 
million  processing  elements.  This  expectation  has  motivated 
the  study  of  parallel  algorithms.  Since  the  complexity  of  a 
parallel  algorithm  depends  very  much  on  the  architecture  of 
the  parallel  computer  on  which  it  is  run,  it  is  necessary  to 
keep  the  architecture  in  mind  when  designing  the  algorithm. 
Several  parallel  architectures  have  been  proposed  and  stu¬ 
died.  In  this  paper  we  shall  deal  directly  only  with  the 
single  instruction  stream,  multiple  data  stream  (SIMD) 
model.  Our  technique  and  algorithms  readily  adapt  to  the 
other  models  (eg:  multiple  instruction  stream  multiple  data 
stream  (MIMD)  and  data  flow  models) .  SIMD  computers  have 
the  following  characteristics: 

(1)  They  consist  of  p  processing  elements  (PEs) .  The  PEs 
are  index-id  0,  1,  ...,  p-i  and  an  individual  PE  may  be 
raferenced  as  in  PE(i).  Each  PE  is  capable  of  perform¬ 
ing  the  standard  arithmetic  and  logical  operations.  In 
addition,  each  PE  knows  its  index. 

(2)  Each  PE  has  some  local  memory. 

(3)  The  PEs  are  synchronized  and  operate  under  the  control 
of  a  single  instruction  stream. 
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An  enable/disable  mask  can  be  used  to  select  a  subset 
of  the  PEs  that  are  to  perform  an  instruction.  Only 
the  enabled  PEs  will  perform  the  instruction.  The 
remaining  PEs  will  be  idle.  All  enabled  PEs  execute 
the  same  instruction.  The  set  of  enabled  PEs  can 
change  from  instruction  to  instruction. 


While  several  SIMD  models  have  been  proposed  and  stu¬ 
died,  in  this  paper  we  wish  to  make  a  distinction  between 
the  shared  memory  model  (SMM)  and  the  remaining  models;  all 
of  which  employ  an  interconnection  network  and  use  no  shared 
memory.  In  the  shared  memory  model,  there  is  a  common 
memory  available  to  each  PE.  Data  may  be  transmitted  from 
PE(i)  to  PE ( j)  by  simply  having  PE(i)  write  the  data  into 
the  common  memory  and  then  letting  PE(j)  read  it.  Thus,  in 
this  model  it  takes  only  0(1)  time  for  one  PE  to  communicate- 
with  another  PE.  Two  PEs  are  not  permitted  to  write  into 
the  same  word  of  common  memory  simultaneously.  The  PEs  may 
or  may  not  be  allowed  to  simultaneously  read  the  same  word 
of  common  memory.  If  the  former  is  the  case,  then  we  shall 
say  that  read  conflicts  are  permitted. 

Most  algorithmic  studies  of  parallel  computation  have 
been  based  on  the  SMM  ((i),  [7],  [8],  [ii],  [12],  [13], 
[24],  [25],  [30]).  This  model  is,  however,  not  very  realis¬ 
tic  as  it  assumes  that  the  p  PEs  can  access  any  p  words  of 
memory  (i  word  per  PE)  in  the  same  time  slot.  In  practice, 
however,  the  memory  will  be  divided  into  blocks  so  that  no 
two  PEs  can  simultaneously  access  words  in  the  same  block. 
If  two  or  more  PEs  wish  to  access  words  in  the  same  memory 
block  then  the  requests  will  get  queued.  Each  PE  will  be 
served  in  a  different  time  slot.  Thus,  in  the  worst  case 
O (p)  time  could  be  spent  transferring  data  to  the  p  PEs. 
All  the  papers  cited  earlier  ignore  this  and  take  the  time 
for  a  simultaneous  memory  access  by  all  PEs  to  be  0(1). 

SIMD  computers  with  restricted  interconnection  networks 
appear  to  be  more  realistic.  In  fact,  the  ILLIAC  IV  is  an 
example  of  such  a  machine.  There  are  several  other  such 
machines  that  are  currently  being  fabricated.  The  largest 
of  these  is  the  massively  parallel  processor  (MPP)  designed 
by  K.  Batcher.  It  has  p*i6K.  A  block  diagram  of  a  SIMD  com¬ 
puter  with  an  interconnection  network  is  given  in  Figure 
1.1.  Observe  that  there  is  no  shared  memory  in  this  model. 
Hence,  PEs  can  communicate  amongst  themselves  only  via  the 
interconnection  network. 

While  several  interconnection  networks  have  been  pro¬ 
posed  (see  [33]),  we  shall  describe  only  three  interconnec¬ 
tion  networks  here.  These  are:  mesh,  cube,  and  perfect 
shuffle.  The  corresponding  computer  models  are  described 
below.  Figure  1.2  shows  the  resulting  interconnection 
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Figure  Block  diagram  of  an  SIMD  computer, 


patterns, 


Mesh  Connected  Computer  (MCC) 


In  this  model  the  PEs  may  be  thought  of  as  logically 
arranged  as  in  a  k  dimensional  array  !\-2'***'n0^ 
where  n.  is  the  size  of  the  ith  dimension  ana  p*nk  - 
*n.  _2*. . .*n_ .  The  PE  at  location  Afi^  i0)  is  coni 
nected  to  tne  PEs  at  locations  A(ik_j,...,  ij+i,...,iJ , 
0^j<k,  provided  they  exist.  Data  may  be  transmitted  from 
one  PE  to  another  only  via  this  interconnection  pattern. 
The  interconnection  scheme  for  a  16  PE  MCC  with  k*2  is  given 
in  Figure  i . 2  (a)  . 


arranged  as 
where  n.  is 
*n.  *.:.*n„ 


id  p-n.. 
is  coni 

*  •  •  •  I  i  n  )  r 


Til r  Vifi'faJLy‘trffn'i  'i * 


ii)  Cube  Connected  Computer  (CCC) 


Assume  that  p 
representation  of 
whose  binary  representation 


where 


2^  and  let  i  . . . . i_ .  .be  the  binary 
i  for  i<[0,p-i]7  Let  1  be  the  number 

'  f  irindl^iq.^ln^hi^ubi 


is  the  complement  o 


model,  PE  ( i )  is  connected  to  PE(i*  ^),  0_<b<q.  As  in  the 
mesh  model,  data  can  be  transmitted  from  one  PE  to  another 
only  via  the  interconnection  pattern.  Figure  1.2(b)  shows 
an  8  PE  CCC  configuration. 


iii)  Perfect  shuffle  Computer  (PSC) 

Let  p,  q,  i  and  i^)  be  as  in  the  cube  model.  Let 
i  -...i0  be  the  binary  representation  of  i.  Define 
SHUFFLE  (I)  and  UNSHUFFLE (i)  to,  respectively,  be  the 
integers  with  binary  representation  *  •ini  and  *0 
iq_^...i..  In  the  perfect  shuffle  moaei,  PE(i)  is^connectea 

to  PE(i(0)),  PE  (SHUFFLE  (i)  )  ,  and  PE  (UNSHUFFLE  ( i ))  .  These 
three  connections  are,  respectively,  called  exchange ,  shuf¬ 
fle,  and  unshuffle.  Once  again,  data  transmission  from  one 
PE  to  another  is  possible  only  via  the  connection  scheme. 
An  8  PE  PSC  configuration  is  shown  in  Figure  1.2(c). 
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Boxes  represent  PEs 
(a)  Ax  A  MCC 


7]  111 


101 


(c)  8  PE  PSC 


Figure  1_.  2 


It  should  be  noted  that  the  MCC  model  requires  2k  con¬ 
nections  per  PE,  the  CCC  model  requires  log  p  (all  loga¬ 
rithms  in  this  paper  are  base  2)  and  the  PSC  model  requires 
only  three  •  connections  per  PE.  The  SMM  requires  a  large 
(and  impractical)  amount  of  PE  to  memory  connections  to  per¬ 
mit  simultaneous  memory  access  by  several  PEs.  It  should 
also  be  emphasized  that  in  any  time  instance,  only  one  unit 
of  data  (say  one  word)  can  be  transmitted  along  an  intercon¬ 
nection  line.  All  lines  can  be  busy  in  the  same  time 
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instance  though. 

Each  of  the  four  models  (including  the  SMM)  described 
above  has  received  much  attention  in  the  literature. 
Agerwala  and  Lint  [i],  Arjomandi  [2],  Csanky  [8],  Eckstein 
[11]  and  Hirschberg  [12]  have  developed  algorithms  for  cer¬ 
tain  matrix  and  graph  problems  using  the  SMM.  Hirschberg 
[13],  Muller  and  Preparata  [24]  and  Preparata  [30]  have  con¬ 
sidered  the  sorting  problem  for  SMM.  The  evaluation  of 
polynomials  on  the  SMM  has  been  studied  by  Munro  and  Pater¬ 
son  [25],  while  arithmetic  expression  evaluation  has  been 
considered  by  Brent  [7]  and  others.  Efficient  algorithms  to 
sort  and  perform  data  permutations  on  an  MCC  can  be  found  in 
Thompson  and  Rung  [38],  Nassimi  and  Sahni  [26]  and  [27],  and 
Thompson  [37].  Thompson's  algorithms  [37]  can  also  be  used 
to  perform  permutations  on  a  CCC  and  a  PSC.  Lang  [19],  Lang 
and  Stone  [20],  and  Stone  [36]  show  how  certain  permutations 
may  be  performed  using  shuffles  and  exchanges.  Nassimi  and 
Sahni  [28]  develop  fast  sorting  and  permutation  algorithms 
for  a  CCC  and  a  PSC.  Dekel ,  Nassimi,  and  Sahni  [9]  present 
efficient  matrix  multiplication  and  graph  algorithms  for 
CCCs  and  PSCs. 

The  algorithms  considered  in  this  paper  are  described 
explicity  only  for  the  SMM.  The  algorithms  are  readily 
translated  into  algorithms  for  the  other  SIMD  models.  In 
some  cases,  it  may  be  necessary  to  use  the  data  broadcasting 
algorithms  developed  by  Nassimi  and  Sahni  [29]  to  accomplish 
this  adaptation  to  the  other  models. 

Throughout  this  paper,  we  assume  that  no  read  conflicts 
are  allowed.  To  see  the  importance  of  this  assumption,  con¬ 
sider  the  partition  problem.  In  this  problem  we  are  given  n 
numbers  a.,  a-,  . ..,  a  and  we  wish  to  determine  if  there  is 
a  subset  S  of^ti,  2,  ...»  n]  such  that  2  ai  =  2  ai. 

This  can  be  done  in  O(log  n)  time  ir  fead  confl  Ms  are 
allowed.  The  first  phase  of  this  algorithm  uses  _|  n/log 
n  I  2n  PEs.  PEs  are  divided  into  2n  groups  of  I  n/log  n  I 
PEs  each.  The  PE  groups  are  indexed  0,  1,  ...,  2n~i .  Each 
PE  group,  i,  considers  the  subset  S.  ■  [a. I  bit  j  of  i  is 
i|.  The  elements  in  S.  are  added  in  OTlog  nr  time  using  the 
I  n/log  n”|  PEs  in  the  PE  group  (this_is  described  later  in 
this  section).  Next,  the  elements  in  S.  are  added.  If  2 

i<si 

as  *  i  a.  then  one  of  the  PEs  in  group  i  sets  V(i)  to  1; 

3  jj<si  ^ 

otherwise  V(i)  is  set  to  0.  In  the  second  phase,  Valiant’s 
[39]  O(log  log  m)  algorithm  is  used  to  determine  the  maximum 
V(i).  Since  there  are  2n  V(i)'s,  this  takes  O(log  n)  time. 
The  answer  to  the  partition  problem  is  "yes"  iff  the  maximum 
V(i)  is  i.  The  total  time  taken  by  the  above  algorithm  is 
0 (log  n) . 


mmmm i 


The  procedure  de 
of  its  steps.  First, 
many  PEs  will  attempt 
To  remove  these  confl 
each  a^,  one  copy  for 
using  no  read  confl 
has  read  conflicts. 
So,  the  complexity 
rithm  is  O(log  n)  if 
0(n)  if  they  are  not. 


scribed  above  has  read  conflicts  in  two 
when  the  PE  groups  are  computing  sums, 
to  simultaneously  read  the  same  a^ . 
icts,  we  will  need  to  make  2n  copies  of 
each  PE  group.  This  takes  0(n)  time 
icts.  Second,  Valiant's  algorithm  also 
Removing  these  also  takes  0(n)  time, 
of  our  parallel  partition  problem  algo- 
read  conflicts  are  permitted,  and  is 


We  first  illus 
simply  problem. 

sum  2  A ( i ) ,  n>i . 
i*i 

rithm  for  this  c 

(... ( (A(i )  +  A (2 ) ) 
efficient  parallel 
the  parsing  2  A{i 

(<A(5)  +  a!!})  + 

corresponding  to  th 
complete  binary  t 
the  computation  for 


trate  the  binary  tree  method 
Let  us  consider  how  we  might 

The  most  frequently  used  seque 


on  a  very 
compute  the 

ntial  algo- 


omputation  uses  the  parsing  2  A(i)  = 

+.A(3))  +  ...  +  A(n)).  To  arrive1  at  an 
algorithm,  it  is  necessary  to  consider 
)  *  (...((A(i)  +  A (2 ) )  +  ( A ( 3 )  +  A ( 4 ) ) )  + 

(A(7 )  +  A ( 8 ) ) ) )  +  ...)  .  Computation 

is  parsing  scheme  is  best  described  by  a 
ree  with  n  leaves.  Figure  1.3  describes 
the  case  n  *  11. 


1  J  + 


2  J  + 


3  + 


4  + 


- 8  + 


10  + 


[AM)  (A(2))  [QM)  ^A))  ^(5/ 

16  17  18  19  20  21 


Figure  _i.3  Computation  tree  for  2  A { i ) 

i  *  1 

The  square  nodes  represent  nodes  at  which  addition  is  to  be 


performed.  The  circular  nodes  represent  initial  data. 
Nodes  have  been  numbered  using  the  standard  numbering  scheme 
tor  complete  binary  trees.  Node  indices  appear  outside  the 
nodes.  Let  V ( i )  be  the  corresponding  A()  value  for  node  i 
if  i  denotes  a  circular  node.  V  ( i )  is  initially  undefined 
tor  the  other  nodes.  Thus  for  Figure  i.3,  V(17)  *  A(2); 

V  ( i  3 )  A  ( 9 )  ;  V  ( 2  i )  *  A(6);  etc.  Using  the  tree  of  Figure 

i.3,  5  A ( i )  may  be  computed  in  4  steps  using  4  PEs  as  fol- 

i*i 

lows : 


step  i:  Use  three  PEs  to  compute,  in  parallel  V(8)  *  V(i6) 

+  V  ( i  7 ) ;  V  (9 )  *  V  (i  8  )  +  V  (i  9  }  ;  and  V(i0)  *  V(20)  + 
V  (2i ) 


step  2:  Use  four  PEs  to  compute,  in  parallel,  V ( i )  *  V(2i) 

+V(2i+i),4<i<7 


step  2:  Use  two  PEs  to  compute,  in  parallel,  V(i)  =»  V(2i)  + 
V(2i  +  i),  2  <  i  <  3 


step  Use  one  PE  to  compute  V ( i )  *  V(2)  +  V(3) 

A(i)  . 


i  i 
1 

i=i 


From  the  nature  of  a  binary  computation  tree,  it  is 
clear  that  parallel  addition  needs  at  most  l _n/2  I  PEs.  The 
parallel  addition  algorithm  is  described  more  formally  in 
Figure  i.4.  In  lines  2  and  5,  the  use  of  a  <  b  <  c  means 
that  this  line  is  to  be  executed  in  parallel  for  all  b 
satisfying  the  inequality.  Line  2  can  be  performed  in  two 

steps  using  l _ n/2 _ I  PEs.  Line  4  needs  at  most  l_n/2  I  PEs. 

It  is  clear  that  the  complexity  of  procedure  SUM!  Ts  O(log 
n)  . 

line  procedure  SUM^  (A,n) 

//compute  5  A(i)  using  J _ n/ 2 _ I  PEs// 

// initial  iie^V/ /  . 

1  k  <-  l_log_  n;_J;  j  <-  2  ;  t  <-  2*(n-j);  p  <-  n-i 

2  V(p+i)  <-  A((i  +  t  -  1)  mod  n  +  i),  1  ^i<n 

3  for  i  <-  k  down  to  0  do  //add  by  levels/7., .  - 

4  V  ( j)  <-  vTTjT  +  V(2r+  1),  21<j<min{p,21  -1} 

5  end  for 

6  return (V (i ) ) 

7  end  SUM  " 


Figure  i_.4 


In  addition  to  analyzing  the  complexity  of  a  parallel 
algorithm,  one  often  (see  Savage  [32])  also  computes  the 
effectiveness  of  processor  utilization  (EPU).  This  is 
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defined  relative  to  a  specific  problem  P;  the  complexity  of 
the  fastest  sequential  algorithm  known  for  P;  and  the  paral¬ 
lel  algorithm  A  for  problem  P. 

EPU(P,A)  » 

complexity  of  the  fastest  sequential  algorithm  for  P 
number  of  PEs  used  by  A  *  complexity  of  A 

For  the  case  of  procedure  SUM! , 

EPu-0(n77T[oqn)"0(I^n) 

Note  that  0  <  EPU  <  1  and  that  an  EPU  close  to  1  is 
—  —  n 

considered  'good'.  For  the  case  of  computing  5  A(i),  we 

can  actually  arrive  at  an_0(log  n)_  algorithm  witft^n  EPU  of 
Q(i)  (i.e.,  using  only  I  n/log  n  I  PEs)  [32].  This  is  done 
by  dividing  the  n  A(i)s_into  ]_  n/log  n~  I  groups,  each  group 
containing  at  most  l~log  n~ I  of  the  A(i)s.  Each  of  these 
groups  is  assigned  to  a  PE  which  sequentially  computes  the 
sum  of  the  numbers  in  the  group.  This  takes  O(log  n)  time. 
Now,  we  need  to  sum  up  these  |~n/log  n~ I  group  sums.  Pro¬ 
cedure  SUMi  can  be  used  to  compute  this  sum  in  O(log  n) 
time . 


Note  that  the  discussion  carried  out  so  far  concerning 

the  computation  of  5  A ( i )  applies  just  as  well  to  the  com- 
n  i=i 

putation  of  ®  A(i)  where  «  is  any  associative  operator 
(for  example,1  max ,  min ,  *,  etc).  Hence,  max  { A  ( i ) > ; 

n 

min  { A  ( i)  } ;  II  A(i);  etc  can  all  be  computed  in  O(log  n) 
i^A4nusing  I  n/iog  n~ I  PEs. 

n 

Suppose  that  instead  of  computing  just  5  A(i),  we 

j  i*i 

wish  to  compute  Sj  ■  5  A(i),  i<  j_<  n.  We  shall  refer  to 

this  problem  as  the  par fci&l  sums  problem.  When  computing  S 
using  the  sequential  algorithm,  we  obtain  S:,  i<  i<  n  as  a 
by-product  and  so,  in  this  case,  no  additional  effort  need 
be  expended.  In  the  case  of  procedure  SUMi  (and  its  refine¬ 
ment  to  the  case  of  |~n/log  n  I  PEs),  however,  all  the  Sj s 
are  not  computed  during  the  computation  of  S  .  Following 
the  computation  of  S  ,  the  remaining  S^s  can  beobtained  by 
making  one  pass  down  the  binary  tree.  In  this  pass  each 
node  transmits  to  its  children  the  sum  of  the  values  to  the 
left  of  the  child. 

Le t  A ( i : i 1 )  *  (1,  i,  2,  3,  i,  2,  i,  2,  3,  4,  2).  Th e 
computation  tree  of  Figure  i.3  is  redrawn  in  Figure  i.5. 
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The  index  ot  each  node  appears  outside  it.  Inside  each  node 
there  are  two  numbers.  The  upper  number  is  V  as  defined  for 
procedure  SUMi.  The  lower  number  in  each  node  is  L;  where 
for  any  node  i,  L  is  defined  as: 


0 


i  = 


a 


L  (i)  - 


L (i/2) 

L ( i/2 ) +V  (  i-i ) 


i  is  even 
i  is  odd 


Figure  _i.5 


One  may  easily  verify  that  if  i  is  a  circular  node 
representing  A(j)f  then  L(i)  *  1  A(p).  Hence,  from  the 

L  values  of  the  circluar  nodes,  o-ni^an  easily  obtain  all 
the  partial  sums.  Our  first  algorithm  for  the  partial  sums 
problem  is  PSUMi  (Figure  i.6).  This  algorithm  simply  com¬ 
putes  the  V(i)s  in  the  first  pass  and  the  L(i)s  in  the 
second.  Finally,  the  S  values  are  computed. 

As  in  the  case  of  SUMi,  the  parallelism  of  lines  4  and 
8  requires  only  n/2  PEs.  Using  n/2  PEs,  line  2  can  be  done 
in  two  steps.  Actually,  procedure  PSUMi  can  be  run  in  O(log 
n)  time  using  only  I  n/log  n~ |  PEs.  The  idea  here,  is  the 
same  as  that  for  SUMi. 

The  perfect  shuffle  connection  scheme  seems  to  be  well 
suited  to  the  binary  tree  method  as  it  contains  an  underly¬ 
ing  complete  binary  tree.  If  we  let  the  PEs  represent  nodes 
in  a  complete  binary  tree,  then  the  left  child  of  PE  i  is  PE 
2i,  and  the  right  child  is  PE  2i  +  i.  Since  2i  * 


line 


procedure  PSUMi  (A,njS) 

//compute  S  ( i)  *  5  A(j),  i<i<n// 

1  k  <-  |_  log2  n_|^7i<-2*S  t<-2* (n- j) ;p<-n-i 

2  V(p+i)<-  Ani+t-i)  mod  n  +  1),  i_<i<n 

3  for  i  <-  k  down  _to  0  do//add.by  levels//  . 

4  V(j)<-  vTTfT  +  V(23+i),  21<j<min{p,2:i  -1} 

5  end  for  ~  ™ 

//compute  Ls// 

6  L  (1 )  <-  0 

7  for  i  <-  i  to  k+i  do//compute  L  by  levels// 

8  L ( j )  <-  if  j  even  then  L(j/2) 

else  L ( i/2 ) +V ( j-1 ) 

i  endif  . 

2  <  j  <minl n  ,'21-i  } 

9  end  fo~r 

10  S ( ( i+t-i ) mod  n  +  1)  <-  L  (p+i) +V  ( p+i)  , i_<i_<n 

11  end  PSUMi 


Figure  _i  .6 

SHUFFLE ( i)  ,  and  2i  +  i  =  EXCHANGE (SHUFFLE ( i) ) ;  the  downward 
pass  is  easily  carried  out.  Also,  PARENT(i)  *  UNSHUFFLE (i)  , 
i  even  and  PARENT ( i)  =  UNSHUFFLE (EXCHANGE (i) ) ,  i  odd.  So  the 
complexity  analysis  for  SUM! ,  and  PSUMi  hold  even  when  a  PSC 
is  used.  For  a  binary  tree  with  n  leaves,  a  PSC  with  n-i 
PEs  is  needed,  however. 

By  using  a  slightly  different  computation  tree  and 
rearranging  the  order  of  computation,  one  can  arrive  at  a 
one  pass  algorithm  for  the  partial  sums  problem.  Let 
A(0:n-i)  be  the  n  numbers,  to  be  added.  Let  S(0:n-i)  denote 
the  partial  sum  array.  A  2-block  of  array  elements  consists 
of  all  array  elements  whose  indices  differ  only  in  the  least 
significant  k  bits.  The  2A-blocks  of  Ai0:i0)  are 
[0 , i ]  ,  [2, 3] , [4 , 5]  ,  [6 ,7] , [8 , 9] ,  and  [10];  the  2  -blocks  are 
[0,1, 2, 3],  [4, 5, 6, 7],  and  [8,9,10];  etc.  Two  2  -blocks  are 
sibling  blocks  iff  their  union  is  a  2K  -block.  Thus,  [0,1] 
and  [2,3]  are  sibling  blocks;  so  also  are  [0,1, 2, 3]  and 
[4, 5, 6,7],  However,  [2,3]  and  [4,5]  are  not  sibling  blocks. 
The  one  pass  algorithm  computes  S  by  first  computing  the 
partial  sums  for  all  2  -blocks  of  A.  In  this  case, 
S(i)*Aii).  Next,  S  is  computed  for  all  2j,-blocks;  then  for 
all  2  -blocks;  and  finally  for  the  single  2^-block 
where  q*|  log^  n  I . 

Let  X  and  Y  be  two  sibling  2k-blocks.  Let  X  be  the 
block  containing  all  .elements  with  bit  k  equal  to  0.  kTbe 
union  of  X  and  Y  is  a  2K  A-block.  Relative  to  this  2K  l- 
block,  the  S  values  for  elements  of  X  are  the  same  as  with 
respect  to  the  corresponding  2  -block.  The  S  values  for 
elements  in  Y  however  change  by  the  sum  of  the  A  elements 
corresponding  to  the  2  -block  X.  Figure  1.7  gives  the  S 
values  and  2  -blocks  when  S  values  are  computed  by  blocks  as 
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described  above.  Blocks  are  enclosed  in  brackets. 


Figure  Computing  S  by  blocks 


The  updating  of  S  when  going  from  one  block  size  to  the 
next  is  easily,  performed  if  we  keep  track  of  the  sum  of  the 
A(i)s  in  each  2  -block.  For  this  purpose,  we  use  an  auxi¬ 
liary  array  T.  T(i)  for  i  in  a  given  2  -  block  (except  pos¬ 
sibly  the  rightmost  2  -block)  is  the  sum  of  all  the  A(i)s  in 
that  block.  Before  we  can  formally  specify  the  partial  sums 
algorithm,  we  need  a  processor  assignment  scheme.  Figure  1.7 
shows  a  processor  assignment  scheme  for  our  example.  Pro¬ 
cessors  are  assigned  only  to  compute  the  S  values  that 
change.  Thus,  when  k*0,  PE(0)  computes  S(i);  PE  (1 >  computes 
S  (3 )  ?  PE  (2)  computes  S(5);  and  PE  (4)  computes  S(9).  When 
k*3,  PE (0 )  computes  S(8);  PE(i)  computes  S(9);  and  PE(2) 
computes  S(10).  PEs  3  and  4  are  idle  when  k*3.  Let  ...  i^ 
*2  *0  ke  the  binary  representation  of  i.  The  PE  assign¬ 
ment  rule  is  obtained  by  defining  the  function  f(i,j)  * 
••ij+i  i0.  For  any  k,  PE(i)  computes 

S(f(i,k)+2  )  provided  that  this  index  of  S  is  no  more  then 
n-i .  The  one  pass  partial  sums  algorithm  is  stated  as  pro¬ 
cedure  PSUM2  (Figure  1.8).  PSUM2  uses  I _ n/2__ I  PEs  indexed  0 

through  I _ n/2 _ I  -  1.  ~ 


It  should  be  easy  to  see  that  our  earlier  ideas  regard¬ 
ing  the  use  of  only  |~n/log  n~ I  PEs  carry  over  to  the  case 
of  PSUM2.  So,  PSUM2  can  be  modified_to  obtain  an  O(log  n) 
one  pass  algorithm  using  only  |~n/!og  n~ I  PEs.  For  the 
modified  algorithm,  EPU*0(1). 
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line  procedure  PSUM2  (A,S,n) 

//one  pass  partial  sums// 

1  declare  A(0 :n-i ) ,S (0 :n-i )  ,  T(0:n-i)‘ 

2  for  each  PE(i)  do  in  parallel 
//initiali ze  S  and  T  for  2  -blocks// 

3  j<-  f(i,0) 

4  S(j)<-  T  ( j)  <-  A  (  j) 

5  S  ( j+i  )  <-  T  ( j+i  ]_<-  A  ( j+ 1 ) 

6  for  k<-  0  to  J  log.n  | -Ido 

//combine  2  -blocks// 

7  j<-  f  (  i,. k) 

8  H  3+2  <n  then 

9  S  ( D+2jTT<~  S(j+2*)+T(j) 

10  T(j+2  )  <-  T(ji+2*)+T(  j) 

11  T(  j)  <-  T  { j+2  ) 

12  endif 

13  end  for 

14  end  for 

15  end  PSUM2 

Figure  i^JB  One  pass  partial  sums  algorithm 
2.  Parallel  Scheduling  Algorithms 

In  this  section,  we  develop  fast  parallel  algorithms  for  a 
variety  of  scheduling  problems.  Each  of  these  algorithms  is 
arrived  at  using  the  binary  tree  method  of  section  i.  We 
shall  refrain  from  providing  explicit  formal  statements  such 
as  those  of  Figures  i.4,  1.6,  and  1.8,  of  these  algorithms. 
Instead,  we  shall  describe  the  algorithms  informally  and 
illustrate  them  with  an  example.  One  should  note  that  we 
are  interested  in  both  the  complexity  as  well  as  the  EPU  of 
the  algorithms  developed. 

All  the  scheduling  problems  to  be  discussed  assume  that 
n  jobs  have  to  be  scheduled  on  m  identical  machines.  Asso¬ 
ciated  with  job  i  is  a  four-tuple  (r.,  d,,  pi ,  w.)  where  r j 
is  its  release  time;  d.  is  its  due  time;  p.  is  its  process¬ 
ing  requi rement;  and  w.  is  its  weight ,  i<i£n.  The  process¬ 
ing  of  no  fob  can  commence  until  its  release  time.  No  job 
can  be  scheduled  for  processing  on  more  than  one  machine  at 
any  time  instant.  Job  i  is  completed  after  it  has  been  pro¬ 
cessed  for  p^  time  units,  if  a  job  does  not  complete  by  its 
due  time,  It  is  tardy.  In  a  nonpr eempti ve  schedule ,  job  i 
is  scheduled  to  process  on  a  single  machine  from  some  start 
time  s^  to  the  completion  time  s.  +  p.  ,  i_<i£n.  In  a  preemp¬ 
tive  schedule  it  is  permissable  toxspIit  the  processing  o? 
jobs  over  machines  as  well  as  over  non-adjacent  time  inter¬ 
vals. 


2.±  Minimizing  Maximum  Lateness 

Let  S  be  a  schedule  for  the  n  jobs  ( r . ,  d.,  p. ,  w . )  .  Let  c. 
be  the  completion  time  of  job  i.  Trie  lateness  of  job  i  is 
defined  to  be  c^d..  The  maximum  lateness,  L  ,  *s 
maxl^-d^}.  We  wish  to  obtain  an  m  machine  nonpreemtive 

schedule  that  minimizes  L  .  This  problem  is  known  to  be 
NP-hard  [22].  So,  we  snail  consider  only  special  cases  of 
this  problem,  i.e.,  cases  for  which  a  polynomial  time 
sequential  algorithm  is  known.  Specifically,  we  shall  con¬ 
sider  the  following  cases:  (i)  p^i-,  i_<i£n  and  all  release 
times  are  integer;  (ii)  m*i  (i.e,  the”number  of  machines  is 
i)  and  preemption  is  allowed;  and  (iii)  cases  (i)  and  (ii) 
with  precedence  constraints.  These  three  cases  are  con¬ 
sidered  in  sections  2.1.1,  2.1.2,  and  2.i.3  respectively. 
Since  the  weights  play  no  part  in  the  L  problem,  we 
shall  only  consider  triples  (r.,  d.,  p.)  in  these  sub¬ 
sections. 


2J.i_.i_  p^=£,  i  <i<n  and  all  release  times  are  integer . 

Jackson  [16]  has  shown  that  when  m*i  and  all  jobs  have  the 
game  release  time,  L  is  minimized  by  scheduling  the  jobs 
in  nondeacreasing  orderof  due  times.  Horn  [14]  and  Baker 
and  Sue  [3]  have  generalized  this  method  to  the  case  when 
m=i  and  all  jobs  do  not  have  the  same  release  time.  An 
optimal  one  machine  schedule  is  now  obtained  by  assigning 
jobs  to  time  slots,  one  slot  at  a  time  starting  at  time  0. 
When  we  are  considering  the  time  slot  [i,i+i],  we  select  a 
job  with  least  due  time  from  among  the  set  of  available 
jobs.(  The  set  of  available  jobs  consists  of  all  jobs  not 
yet  selected  that  have  a  release  time  less  than  or  equal  to 
i.)  If  this  set  is  empty,  then  this  slot  is  left  idle. 
This  strategy  can  be  implemented  to  run  in  O(nlog  n)  time  on 
a  single  processor  computer.  Blazewicz  [6]  has  extended 
this  idea  to  the  general  case,  m>i .  His  algorithm  also 
schedules  by  time  slots.  Let  J  be  the  set  of  jobs  available 
when  slot  [i,i+i]  is  to  be  scheduled.  If  I J |<m  then  all  the 
available  jobs  are  processed  in  [i,i+i].  If  |J|>m,  then  we 
select  m  jobs  with  least  due  times. 

In  developing  the  parallel  algorithm,  we  first  consider 
the  case  m*i .  The  algorithm  of  Horn  is  readily  seen  to  be 
highly  sequential.  No  decision  concerning  time  slot  [i,i+i] 
can  be  made  unless  we  know  the  jobs  that  are  available  at 
this  time.  This  of  course  depends  on  which  jobs  were 
selected  for  the  earlier  time  slots.  So,  a  straightforward 
adaptation  of  Horn's  algorithm  would  need  n  steps  (one  for 
each  time  slot) .  The  overall  complexity  of  the  resulting 
parallel  algorithm  would  be  fl(n) .  This  is  not  very  good..  We 
are  really  interested  in  algorithms  with  complexity  0(logKn) 
for  some  k. 
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Despite  the  highly  sequential  nature  of  Horn's  method, 
his  idea  can  be. used  to  arrive  at  a  parallel  algorithm  with 
complexity  0(log2n).  This  is  accomplished  using  the  binary 
tree  method.  It  is  helpful  to  consider  an  example.  Suppose 
we  have  i4  jobs  with  r.,  and  d^  as  specified  in  Figure 
2.1(a).  The  first  step  in  our  proposed  parallel  algorithm 
is  to  sort  the  jobs  by  release  times  (into  nondecreasing 
order).  Jobs  with  the  same  release  time  are  sorted  into  non¬ 
decreasing  order  of  due  time.  Let  R. ,  R. , . . . ,  and  R.  be  the 
k  distinct  release  times  of  the  n  3 (R- <Rj < • • • <Rk) •  Let 
R.+.=oo.  For  our  example,  the  sorted  sequence^  of  jobs  is 
snotn  in  Figure  2.1(b);  k*4;  and  R.*2,  R2*5,  R3*6,  R4*9,  and 
R^=oo. 
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Figure  2.1^ 


Next,  a  binary  computation  tree  is  associated  with  the 
problem.  The  tree  used  is  the  unique  complete  binary  tree 
with  k  leaves.  With  each  node  in  this  tree,  we  associate  a 
time  interval  (trite).  Assume  that  the  leaf  nodes  are  num¬ 
bered  1  through  k;  left  to  right.  The  ith  leaf  node  has 
associated  with  it  the  interval  <R. , i ) #  -  The  inter¬ 
val  (t.,tR)  associated  with  a  nonleaf  node,  N,~  is  obtained 
from  the  intervals  associated  with  the  two  children  of  this 
node,  t .  (N) *  t.(left  child  of  N)  and  tR(N)«  tR(right  child 
of  N) .  For  our  example,  the  binary  computation  tree 
together  with  time  intervals  is  shown  in  Figure  2.2. 


A  schedule  that  minimizes  L  is  now  obtained  by  mak¬ 
ing  two  passes  over  this  computation  tree.  The  first  pass 
is  made  level  by  level  towards  the  root;  the  second  is  made 
level  by  level  from  the  root  to  the  leaves.  Let  P  be  any 
node  in  the  computation  tree.  Let  the  interval  associated 
with  P  be  (tL,tR).  The  set  of  available  jobs,  A(P)  for  P 
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Fig  ire  2.2  Computation  tree  for  the  example  of  Figure  2.1. 

consists  Exactly  of  those  jobs  that  have  a  release  time  r. 
such  that  tj<r.<tR.  This  set  of  jobs  may  be  partitioned 
into  two  subsets,  respectively  called  the  used  set  and  the 
transferred  set.  The  set  of  used  jobs  consists  exactly  of 
those  available  jobs  that  will  be  scheduled  between  t.  and 
tR  for  the  Lmax  problem  defined  by  the  job  set  A(P)7  The 
remaining  jobs  in  A(P)  make  up  the  transferred  set.  For  our 
example,  the  set  of  available  jobs  for  the  node  representing 
the  interval  (2,  6)  is  {5,  8,  6,  2,  3,  i,  4,  9}.  If  Horn's 
algorithm  is  used  on  this  set  of  jobs,  then  jobs  5,  8,  6, 
and  2  will  get  scheduled  in  the  interval  from  2  to  6. 
Hence,  the  used  set  is  {5,  8,  6,  2}  and  the  transferred  set 
is  {3,  i,  4,  9}. 

In  the  first  of  the  two  passes  mentioned  above,  the 
used  and  transferred  sets  for  each  of  the  nodes  in  the  com¬ 
putation  tree  are  determined.  For  a  leaf  node  the  used  and 
transferred  sets  are  determined  by  directly  using  Jackson's 
rule.  If  P  is  a  leaf  node  for  the  interval  (t,,tR),  then 
the  used  set  is  obtained  by  selecting  jobs  from  tne  avail¬ 
able  job  set  A(P)  for  P  in  nondecreasing  order  of  due  times. 
Since  jobs  with  the  same  release  time  have  already  been 
sorted  by  due  times,  the  used  set  consists  of  the  first 
mint  IA(P)  I  ,  tg-t,}  jobs  in  A(p).  The  remaining  jobs  form 
the  transferred  set.  For  our  example,  for  the  interval 
(2,5),  the  set  of  used  jobs  is  {5,8,6}  while  the  set  of 
transferred  jobs  is  {2,3};  for  the  interval  {5,6},  the  used 
set  is  {i}  and  the  transferred  set  is  {4,9};  etc.  Figure 
2.3  shows  the  used  and  transferred  sets  for  each  of  the  leaf 
nodes  for  our  example.  The  solid  vertical  line  separates 
the  used  jobs  from  the  transferred  jobs. 
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For  a  nonleaf  node,  the  used  and  transferred  sets  may 
be  computed  from  the  used  and  transferred  sets  of  its  chil¬ 
dren.  Let  P  be  a  nonleaf  node  and  let  U.  ,  UR,  T.,  and  TR  be 
the  used  and  transferred  sets  for  its  lef t  andLr ight-cni 1- 
dren  respectively.  Let  (t.,tR),  (tA,  tR)  ,  and  (t£,  tR)  be 
the  intervals,  respectively,  associated  with. node  P,  its 
left  child,  and  its  right  child.  Clearly,  t.*tA;  t  =  tR; 
and  tR*t£.  It  should  be  clear  that  if  Hornes  algorithm  Ts 
used  to  schedule  the  available  jobs  A ( P )  then  the  jobs  in  U. 
will  be  the  ones  scheduled  from  tL  to  tR.  The  set  of  jobs 
scheduled  from  tA,  to  tB  will  be  some. subset  of  T.  U  UR. 
Let  Q  denote  the  min{ |T.  U  U  |,  tR-tf}  jobs  of  T.  0  UR  that 
have  least  due  times.  If  is  not  too  difficult  toLsee  that  Q 
is  the  subset  of  A(P)  that  is  scheduled  by  Horn's  algorithm 
in  the  interval  tR  to  tR.  Hence,  the  used  set  for  P  is  l),  U 
Q  and  the  transferred  set  is  TR  U  A(P)  -  Q.  Observe  thatLif 
U, ,  Uc,  T, ,  and  TB  are  in  nondecreasing  order  of  deadlines, 

nd  T. 

the 
n  the 

used  set  in  nondecreasing  order  of  due  times.4-  Another  merge 
yields  the  transferred  set  in  nondecreasing  order  of  due 
times.  Figure  2.3  gives  the  used  and  transferred  sets  in 
nondecreasing  order  of  due  times  for  all  nodes  in  our  exam¬ 
ple  computation  tree. 

In  the  second  pass,  the  used  sets  are  updated  so  that 
the  used  set  for  a  node  representing  the  interval  (t  ,tR)  is 
precisely  the  subset  of  jobs  (from  amongst  all  n  jobs)  Kthat 
is  scheduled  in  this  interval  by  Horn's  algorithm  when  solv¬ 
ing  the  L  problem  for  the  entire  job  set.  This  is  done  by 
working  down  the  computation  tree  level  by  level  starting 
with  the  root.  The  used  set  for  the  root  node  is  unchanged 
in  this  pass.  If  P  is  a  node  whose  used  set  been  updated 
then  the  used  sets  for  the  left  child  and  the  right  child  of 
P  are  obtained  in  the  following  way.  Let  the  interval  asso¬ 
ciated  with  P  be  (t.,tR)  and  -let  the  interval  associated 
with  its  left  chila  be  (t,,tR).  Let  V  be  the  subset  of  the 
used  set  of  P  consisting  solely  of  jobs  with  a  release  time 
less  than  t..  Let  U  be  the  current  used  set  (i.e.  the  one 
computed  in  trie  first  pass)  for  the  left  child  of  P.  Let  W 
be  the  set  obtained  by  merging  U  and  V  (note  that  U  and  V 
are  disjoint  and  that  both  are  ordered  by  due  times) .  The 
new  used  set,  . U  ,  for  the  left  child  of  P  consists  of  the 
first  min{lW|,  tR-t.}  jobs  in  W.  The  used  set  for  the  right 
child  of  P  consits.of  all  jobs  in  the  used  set  for  P  that 
are  not  included  in  UA. 

Let  us  now  go  through  this  second  pass  on  our  example. 
Let  P  be  the  root  node.  (tL,t_)*  (2,oo)  and  Hence, 
the  new  used  set  for  the  left  child  of  P  is  simply  its  old 
used  set.  The  used  set  for  the  right  child  of  P  becomes  {3, 
7,  i,  4,  12,  9,  i3,  11,  14,  i0).  Now,  let  P  be  the  right 
child  of  the  root.  (tL,tR)»  (6,oo);  V*{3,  1,  4,  9};  W*{3, 


tnen  the  Set  Q  can  be  obtained  by  merging  together  UR  a 
and  selecting  the  first  min{ IT.  U  UR I ,  t_-tR}  jobs  from 
merged  list.  Q  can  next  be  merged  withKUt  to  obtai 
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(2,-) 


Figure  2.3  First  pass  o f  the  L _  algorithm 

7,  ir  4,  9,  i0}.  The  new  used  set  for  the  left  child  of  p 
is  {3,  7,  i}.  The  new  used  set  for  the  right  child  of  P  is 
{ 4 ,  12,  9,  13,  11,  14,  10}.  Figure  2.4  shows  the  new  used 
sets  for  all  the  nodes  in  the  computation  tree. 

From  the  definition  of  an  updated  used  set,  it  follows 
that  the  schedule  defined  by  the  leaf  nodes  (for  our  exam¬ 
ple,  this  is:  job  5  at  time  2,  job  8  at  time  3,  job  6  at 
time  4,  job  2  at  time  5,  etc.)  minimizes  L  .  The  correct¬ 
ness  of  the  node  updating  procedure  is  easily  seen.  If  P  is 
the  root  node,  then  it  represents  the  interval  (R-,oo).  All 
jobs  are  necessarily  scheduled  in  this  interval  by  Horn's 
algorithm.  Hence,  the  updated  used  set  for  this  node  con¬ 
sists  of  all  n  jobs.  Now,  let  P  be  any  nonleaf  node  for 
which  we  have  obtained  the  updated  used  set.  Assume  that 
this  is  in  fact  the  correct  updated  used  set,  i.e.,  it  con¬ 
sists  exactly  of  those  jobs  scheduled  by  Horn's  algorithm  in 
that  interval.  We  shall  show  that,  the  updating  procedure 
gives  the  correct  used  sets  for  the  left  and  right  child  of 
P.  Let  t. ,  ti,  tR,  V,  W,  U,  and  U1  be  as  defined  in  the 
updating  procedure.  Let  X  be  the  used  set  for  P.  From  the 
way  the  first  pass  works,  it  follows  that  only  jobs  from  N* 
U  U  V  are  candidates  for  scheduling  by  Horn's  algorithm,  in 
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Figure  2.£  Results  of  second  pass. 

the  interval  ( '  V  r^ ’  *s  a  s*mP^e  matter  to  see  that 
only  min{|W|,  Lti-  t  }  of  these  can  be  schedulded  in  this 
interval;  f ur therthese  jobs  are  selected  in  nondecreasing 


order  of  due  times.  Hence,  U 


is  correctly  computed.  From 
child  must 


this  it  follows  that  the  used  set  for  the  right 
be  X-IT 


Having  established  the  correctness  of  our  parallel  pro¬ 
cedure,  we  are  ready  to  determine  its  complexity  as  well  as 
the  required  number  of  PEs.  The  first  step2  consists  of 
sorting  the  jobs.  This  can  be  done  in  O(log^n)  time  using 

I _ n/2 _ I  PEs  [4],  In  both  the  first  and  second  passes  over 

the  computation  tree  we  are  essentially  performing  a  fixed 
number  of  merges  of  ordered  sets  at  each  node.  Using  Batch¬ 
ers  bitonic  merge  scheme  ([4]  ,[i8]),  a  p  element  ordered 
set  can  be  .  merged  with  a  q  element  ordered  set  using 

I _ ( p+q) /2 _ |  PEs  in  O(log(p+q})  time.  Hence,  the  overall 

complexity  of  our  parallel  L  algorithm  is  O(iog^n).  The 
number  of  PEs  used  is  l_n/2  T.  The  EPU  of  this  algorithm  is 
O(.nlog  n/(n/2  log^n)  )  *_0(l7log  n)  . 


Our  parallel  L  algorithm  for  the  case  m=i  easily 
generalizes  to  the"case  m>  i .  The  two  passes  over  the  compu¬ 
tation  tree  are  changed  so  that  all  uses  of  tR-t,  and  tR-t. 
are  replaced  by  m(tR-t.)  and  m(tR-tj)  respectively.  RThe 
schedule  is  obtained  fromLthe  updated  used  sets  of  the  leaf 
nodes.  The  ith  job  in  this  used  set  is  assigned  to  the  i 
mod  m  +  ith  machine. 


2. ±.2  and  preemptions  permitted 

Horn's  [14]  algorithm  for  this  problem  is  quite  similar  to 
the  sequential  algorithm  for  the  case  discussed  in  section 
2.i.i  and  also  has  a  sequential  complexity  that  is  O(nlog 
n) .  A  schedule  with  minimum  Lis  obtained  by  starting  at 
the  first  release  time  and  consfSering  an  available  job,  i, 
with  least  due  time.  Let  the  processing  time  of  this  job  be 
p.  Let  the  time  to  the  next  release  time  be  t  and  let  the 
current  time  be  T.  Job  i  is  scheduled  from  T  to  T  + 
min{p,t}.  The  current  time  changes  from  T  to  T  +  min{p,t} 
and  the  remaining  processing  time  for  job  i  becomes  p- 
min{p,t}.  Next,  from  the  available  job  set  at  the  current 
time  T  a  job  with  minimum  due  time  is  selected  for  process¬ 
ing,  and  so  on. 

The  parallel  algorithm  of  section  2.1.1  can  be  adapted 
to  thi,s  case.  Jobs  are  sorted  as  before  and  two  passes  are 
made  over  the  tree.  In  the  first  pass,  used  and  transferred 
sets  are  computed  for  each  node.  In  the  second  pass,  the 
used  sets  are  updated.  For  the  first  pass,  the  used  and 
transferred  sets  for  the  leaf  nodes  are  obtained  by  comput¬ 
ing  the  partial  sum  sequence  for  the  ordered  set  of  avail¬ 
able  jobs  for  each  leaf  (see  the  algorithm  of  Figure  1.8). 
Next,  for  each  leaf  we  determine  the  first  partial  sum,  j, 
(if  any)  that  exceeds  the  value  of  tR-t.  for  that  node.  If 
there  is  no  such  partial  sum,  then  allK  tne  available  jobs 
are  used.  If  there  is,  then  the  used  set  consists  of  jobs 
1,  2,  ...,  j-i  together  with  a  fraction,  f,  of  job  j.  This 
fraction  is  chosen  such  that  the  sum  of  the  processing  times 
of  jobs  1,  2,  ...,  j-i  and  f  times  that  of  job  j  equals  t_- 
t..  The  transferred  set  consists  of  i-f  of  job  j  together 
with  the  remaining  jobs. 

For  nonleaf  nodes,  the  used  and  transferred  sets  are 
computed  from  the  corresponding  sets  for  the  left  and  right 
children.  Let  P  be  a  nonleaf  node.  Let  Q  and  S  be  its  left 
and  right  children  respectively.  The  used  set  for  P  is 
obtained  by  merging  (according  to  due  times)  the  transferred 
set  of  Q  with  the  used  set  of  S,  to  obtain  W.  The  partial 
sums  for  W  are  computed  and  W  is  partitioned  into  Wi  and  W2 
such  that  the  sum  of  the  processing  times  for  the  jobs  in  Wi 
equals  min{sum  of  processing  times  in  W,  tR-t£}  where 
(t£,tR)  is  the  interval  associated  with  node  S.  Observe  that 


this  partitioning  of  W'  may  require  us  to  split  one  of  the 
jobs  in  W  in  the  same  way  as  was  done  for  leaf  nodes.  The 
used  set  for  P  is  obtained  by  merging  together  Wi  and  the 
used  set  for  Q.  The  transferred  set  for  P  is  obtained  by 
merging  together  W2  and  the  transferred  set  for  S. 

The  updating  of  the  seond  pass  is  also  carried  out  in  a 
manner  similar  to  that  used  in  section  2,i.i.  The  updated 
used  set  for  the  root  node  consists  of  all  n  jobs.  Let  P  be 
a  node  for  which  the  updated  used  set  has  been  computed. 
Let  (t.,tR)  be  the  interval  associated  with  P.  Let  Q  and  S, 
respectively,  be  the  left  and  right  children  of  P.  Let  the 
interval  associated  with  Q  be  (t.,tR).  Define  V  to  be  the 
set  of  all  jobs  in  the  used  set  of  P  that  have  a  release 
time  less  than  t..  Merge  V  and  the  current  used  set  of  Q 
together.  Let  the  resulting  ordered  set  be  W.  Compute  the 
partial  sums  for  W  and  partition  W  into  Wl  and  W2  as  was 
done  in  the  first  pass.  Once  again,  it  may  be  necessary  to 
split  a  job  into  two  to  accomplish  this.  The  used  set  for  Q 
is  Wi .  The  remaining  jobs  in  the  used  set  of  P  (including 
possibly  a  remaining  fraction  of  a  job  that  went  into  Wi ) 
constitute  the  used  set  for  S. 

Once  the  updated  used  sets  for  the  leaves  have  been 
computed,  a  schedule  minimizing  L  is  obtained  by  schedul¬ 
ing  the  used  sets  of  the  leaves  inthe  intervals  associated 
with  them.  For  each  such  interval,  the  scheduling  is  in  non¬ 
decreasing  order  of  di\e  time. 

The  correctness  of  the  algorithm  described  above  fol¬ 
lows  from  the  correctness  of  Horn's  algorithm  and  the  dis¬ 
cussion  in  section  2.i.i.  The  algorithm  can  be  run  in 
O(log^n)  time  using  at  most  3n/2  PEs .  Note  that  because 
jobs  may  split,  we  may  at  some  level  have  a  total  of  n+2k 
jobs  (or  job  parts).  Recall  that  k  denotes  the  number  of 
distinct  release  times  and  that  at  each  node  at  most  one 
additional  job  split  can  occur.  Because  of  the  effective 
increase  in  number  of  jobs,  more  than  l_n/2_|  PEs  are  needed 

here,  while  only  I _ n/2_ I  were  needed  in  section  2.1.1.  The 

EPU  is  still  O(i/log  n)7 

Example  2.^:  Figure  2.5  gives  an  example  job  set.  Since 
the  gobs  “are  already  in  the  order  desired,  we  may  begin 
directly  with  the  first  pass  over  the  computation  tree. 
Figure  2.6  gives  the  result  of  the  first  pass.  Figure  2.7 
gives  the  result  of  the  second  pass.  [] 
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Figure  2.7 


2.1.3  Procedence  Constraints 


Suppose  that  the  set  of  jobs  to  be  scheduled  defines  a  par¬ 
tial  order  <.  i  <  j  means  that  the  processing  of  job  j  can¬ 
not  commence  until  the  processing  of  job  i  has  been  com¬ 
pleted.  Let  (r.,  d.,  p.)  be  the  release,  processing,  and 

due  times  of  job  x.  Modify  the  release  and  due  times  as 
below: 

r|  -  max{ r . ,  max{r.+p.}} 
i<  j  J  ■> 

d|  *  max{d. ,  max{d.-p. }} 
i<j  2  J 

Rinooy  Kan  [31]  has  observed  that  a  schedule  minimizing  L 
when  p^i,  the  r.s  are  integer,  and  <  is  a  partial  order  can 
be  obtained  by  simply  using  Horn's  algorithm  (cf.  section 
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2.1.1)  on  the  jobs  ( ,p^=i ,d^ )  ,  i_<i_<n  with  no  precedence 
constraints.  Since  the  modi  tied1  releas'e  and  doe  times  can 
be  computed  in  O(log  n)  time  using  the  critical  path  algo¬ 
rithm  of  [9],  a  schedule  minimizing  L  in  the  presence  of 
precedence  constraints  can  be  obtcif^ed  in  O(log2n)  time 
(1X1=1,^  =  !),  The  number  of  PEs  needed  by  the  algorithm  of 
[9J9  is  r»  /log  „n,  so  the  EPU  of  the  resulting  algorithm  is 
O(nziog  n/ ( nJlogzn)  )  )  =  0(i/(nlog  n)  )  . 

When  m=i,  a  partial  order  <  is  specified,  and  preemp¬ 
tions  are  allowed,  a  schedule  minimizing  L  can  be 

obtained  by  computing  modified  release  and  dueaxtimes  as 
above  and  then  using  the  algorithm  of  section  2.i.2  on  the 
modified  jobs.  The  resulting  algorithm  has  complexity 
0  ( loq  n) ;  uses  Ofn'/log  n)  PEs;  and  has  an  EPU  that  is 

0  ^nTogn^ * 


2^.2  Minimizing  Total  Costs 

Let  (ri#  d^  p^  w.  ,  i^i£n  define  n  jobs.  Let  S  be  any  one 
machine  schedule  for  these  jobs.  The  completion  time  c-.  of 
job  i  is  the  time  at  which  it  completes  processing.  Jofe  i 
is  tardy  iff  c^  >  dj.  The  tardiness  T^  of  job  i  is 
max{0,c.-di  } .  When  p^i,  Horns  [14]  algor i thm1 described  in 
section  2 . i . 2  also  finds  a  schedule  that  minimizes  £t- . 

A  schedule  that  minimizes  Sw.c.  when  p.  =  i,  i^i£n  and 
m=i  can  be  obtained  by  extending  Smith's  rule  (see  Rinnooy 
Kan  [31]).  Smiths  rule  [35]  minimizes  Sw-c^  when  ri=0, 
i_<i£n.  It  essentially  schedules  jobs  in  nonaecreasing  order 
of  P^/w-.  The  extension  to  the  case  when  p.=l,  i<i<n  and 
the  r-5  may  be  different  (but  integer)  works  in  following 
way.  Scheduling  is  done  time  slot  by  time  slot.  From  the 
set  of  available  jobs  for  any  slot,  a  job  with  least  i/w. 
(or  equivalently,  maximum  w^)  is  selected  and  scheduled  iiH 
this  slot.  This  procedure  is  quite  similar  to  that  used  for 
the  L  problem  with  p.*i  (see  section  2.2.1)  .  The  only 
difference  is  that  Smith's  rule  replaces  the  use  of 
Jackson's  rule  2.1.1.  The  used  and  transferred  sets  are 
now  kept  in  nonincreasing  order  of  weights. 

Since  the  preemptive  schedule  obtained  by  the  algorithm 
of  section  2.i.2  also  minimizes  2T.,  this  problem  is  easily 
solved  in  parallel.  When  Sc.  is  to1be  minimized,  m=i ,  and 
preemptions  are  permitted,  the  algorithm  of  section  2.i.2 
can  can  still  be  used.  This  time,  however,  the  used  and 
transferred  sets  are  maintained  in  nondecreasing  order  of  p, 
rather  than  d.[3i]. 
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a  release  time  r^=0.  The  fastest  sequential 
this  problem  is  due  to  Hodgson  and  Moore  [23]. 
of  the  following  three  steps: 

Step  1_:  Sort  the  n  jobs  into  nondecreasing  < 

times.  Initialize  the  set  R  of  tardy  jobs  to  be 
empty. 

Step  2:  If  there  is  no  tardy  job  in  the  current  sorted 

sequence,  then  append  the  jobs  in  R  to  this 
sequence.  This  yields  the  desired  schedule.  Stop. 

Step  Find  the  first  tardy  job  in  the  current  sorted 

sequence.  Let  this  be  in  position  j.  Find  the  job 

with  the  largest  processing  time  from  amongst  the 
first  j  jobs  in  this  sequence.  Remove  this  job 

from  the  sequence  and  add  it  to  R.  Go  to  step  2. 


The  time  complexity  of  the  Hodgson  and  Moore  algorithm 
is  O(nlogn).  As  in  the  case  of  the  Hodgson  and  Moore  algo¬ 
rithm,  our  parallel  algorithm  for  this  problem  begins  by 
sorting  the  jobs  -..to  nondecreasing  order  of  due  times, 
within  due  times,  jobs  are  sorted  by  p. .  Let  D. ,  D_ ,  ..., 
and  D.  (D- <D_ < . . . <D .  )  be  the  k  distinct  due  tim&s  associated 
with  the  h  jobs.  Let  D_=0 .  We  next  consider  the  unique 
complete  binary  tree  tnat  has  exactly  k  leaves.  If  the  leaf 
nodes  of  this  tree  are  considered  from  left  to  right,  then 
with  the  ith  leaf  we  associate  the  interval  (D - _ - , D. ) .  The 
interval  associated  with  a  nonleaf  node  is  (t.,t2f  iff  there 
exists  t-  such  that  (t- ,t,)  and  {t-,t2)  are1the  intervals, 
repectively,  associated  tatn  its  left  and  right  children. 
If  the  interval  (t.,t-)  is  associated  with  some  node  P,  then 
all  jobs  with  a  due  t.me  d  such  that,  t,<d.<t2  are  associated 
with  that  node. 

The  set  J(P)  of  jobs  associated  with  any  node  P  may  be 
partitioned  into  two  sets  S(P)  and  R(P).  S(P)  and  R(P)  are 
defined  in  the  following  way.  Consider  the  problem  of 
obtaining  a  schedule  that  minimizes  the  number  of  tardy  jobs 
for  J(P)  assuming  that  all  jobs  in  J(P)  have  a  release  time 
t.  ((t-,t2)  is  the  interval  associated  with  P) .  S(P)  is  the 
s4t  ofAnon  tardy  jobs  in  this  schedule  while  R(P)  is  the  set 
of  tardy  jobs.  It  is  well  known  [16]  that  if  all  jobs  in 
S (P)  are  scheduled  in  nondecreasing  order  of  due  times  then 
no  job  in  S(P)  will  be  tardy.  From  the  definition  of  S  and 
R,  it  is  clear  that  S(root)  defines  the  set  of  non  tardy 
jobs  in  a  schedule  for  all  n  jobs  that  minimizes  the  number 
of  tardy  jobs.  These  jobs  may  be  scheduled  at  the  front  of 
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the  schedule  in  nondecreasing  order  of  due  times.  The 
remaining  jobs  can  be  scheduled,  in  any  order,  after  the 
jobs  in  S (root) . 

For  a  leaf  node  P,  S(P)  and  R(P)  are  easily  computed. 
First  the  partial  sum  sequence  for  J(P)  is  obtained  (recall 
that  the  jobs  associated  with  P  are  in  nondecreasing  order 
of  .  Let  the  interval  associated  with  P  be  (t.,t_).  Ail 
jobs1with  a  partial  sum  that  is  less  than  or  equal  to  t_-t. 
are  in  S(P).  The  remainder  are  in  R(P).  1 

Let  us  consider  an  example.  Figure  2.8(a)  shows  a  set 
of  i0  jobs.  In  Figure  2.8(b),  these  jobs  have  been  ordered 
by  due  times  and  within  due  times  by  p . .  There  are  four 
distinct  due  times,  and  we  have  D (0 : 4 ) * (0 , 8 , 15, i7 , 25) .  Fig¬ 
ure  2.9  shows  the  complete  binary  tree  with  four  leaves. 
The  interval  associated  with  each  node  is  also  given.  The  S 
and  R  sets  for  each  of  the  leaf  nodes  are  also  shown. 
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Figure  2.8 


The  computation  of  S  and  R  for  a  nonleaf  node  P  is  done 
using  the  S  and  R  sets  of  its  left  child  Q  and  its  right 
child  T.  Let  the  interval  associated  with  Q  and  T,  respec¬ 
tively,  be  (t,,tp)  and  (t^,  t_) .  It  is  clear  that  S(T)  c 
S(P)  and  that  RTQ;c  R(P).  To  get  the  remaining  jobs  in 
S(P),  we  merge  together  the  jobs  in  S(Q)  and  R(T).  Let  the 
resulting  ordered  set  be  W.  The  partial  sum  sequence  of  the 
processing  times  of  the  jobs  in  W  is  next  computed.  Let  V 


be  the  subset  of  W  consisting  of  jobs  that  have  a  partial 
sum  sequence  no  more  than  tp-t..  Let  X*W-V.  Clearly,  V  c 
S(P).  However,  V  U  S(T)  may  not  equal  S(P)  as  it  is  possi¬ 
ble  for  (at  most)  one  of  the  jobs  in  X  to  also  be  in  S(P). 
To  determine  this  job,  we  first  determine  for  each  due  time 
D.,  £  Di  <  tR,  a  j°b  in  x  that  has  least  processing  time 
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Figure  2.9 


amongst  ail  jobs  in  X  with  due  time  D. .  If  there  are  no 

jobs  in  X  with  a  certain  due  time  D. ,  then  no  job 

corresponding  to  this  due  time  is  selected.1  Let  the  set  of 
jobs  determined  in  this  way  be  U  *  {J-.,J_,  . . .  ,  J  }.  Let 
-  2  For  each  due  time  Di ,  cDj  <  tR,  deter¬ 

mine  the  ^Ym  of  the  processing  times  of  all  jobs  in  S(T) 
with  due  times  no  more  than  D- .  Let  this  sum  be  Y. .  Let 
*  Di~Yi-tR*  Now,  compute  y'^*min{Aj )  •  It  can  be  seen  that 

the  job  (if  any)  in  U  with  due  2*me  D.  can  be  in  S(P)  only 

if  its  processing  time  is  less  than  or  equal  to  £  +  y. .  This 

information  is  used  to  remove  from  U  those  jobs  that  xcannot 
possibly  be  in  S(P).  From  the  remaining  jobs,  the  job  r 

with  minimum  processing  time  is  selected  and  added  to  S(P). 

R(P)  *  R(Q)  U  (X-{r}).  The  S  and  R  sets  for  all  nonleaf 

nodes  in  our  example  are  specified  in  Figure  2.9. 


The  sets  U  and  can  be  computed  in  0(log  n)  time 

0(n)  PEs  if  s  and  R  are  available  in  nondecreasing 


using  0(n)  PEs  if  S  and  R  are  available  in  nondecreasing 
order  of  due  times  (so  it  is  necessary  to  keep  two  copies  of 
each  S  and  R;  one  ordered  by  processing  times  and  one  by  due 
times) .  The  /.  s  may  be  computed  in  0(log  n)  time  using 
0(n/log  n)  PEs  using  a  modified  version  of  the  partial  sums 


i 


algorithm.  Merging  S(Q)  and  R (T)  by  processing  times  or  by 
due  times  requires  O(log  n)  time  and  n/2  PEs.  So,  all  the 
work  needed  to  be  done  at  any  level  can  be  accomplished  in 
O(log  n)  time  with  0(n)  PEs.  The  overall  complexity  of  our 
parallel  algorithm  is  therefore  O(log^n)  and  its  EPU  is 
0  (i/log  n)  . 


Job  Sequencing  With  Deadlines 


The  problem  of  minimizing  the  sum  of  the  weights  of  the 
tardy  jobs  is  commonly  referred  to  as  the  job  sequencing 
with  deadlines  problem  [15].  It  is  assumed  that  r.*0,  and 
p^*!,  .  When  the  assumption  p.*i  is  not  hade,  the 
problem  Ts  known  to  be  NP-hard  [17].  We  Shall  now  proceed  to 
show  how  the  binary  tree  method  leads  to  an  efficient  paral¬ 
lel  algorithm  for  this  problem.  We  shall  explicity  consider 
only  the  case  m«i.  When  m>i,  the  problem  can  be  transformed 
into  an  equivalent  m*i  problem.  Further,  all  the  d.s  are 
assumed  to  be  integers.  1 


An  0(n  log  n)  sequential  algorithm  for  this  problem 
appears  in  [15].  This  algorithm  builds  an  optimal  schedule 
by  first  determining  the  set  of  jobs  that  are  to  be  com¬ 
pleted  by  their  due  times.  This  is  done  by  considering  the 
jobs  in  noni ncreasing  order  of  weights.  The  job  currently 
being  considered  is  added  to  the  set  of  selected  jobs  iff  it 
is  possible  to  schedule  this  job  and  all  previously  selected 
jobs  in  such  a  way  that  all  of  them  complete  by  their 
respective  due  times. 


In  our  parallel  algorithm,  we  begin  by  sorting  the  jobs 
by  due  times.  Jobs  with  the  same  due  time  are  sorted  into 
nonincreasing  order  of  weight.  Figure  2.10(a)  shows  an 
example  job  set.  Figure  2.10(b)  shows  that  result  of  sort¬ 
ing  this  job  set.  Let  the  distinct  due  times  be  D.  , 
D_ , . . . »D.  (D. <D- <. . . <D^) .  Let  D0*0.  The  computation  tree 
to  use  is  the'unfque  complete  binary  tree  with  k  leaves. 
Consider  these  leaves  left  to  right.  With  leaf  i,  we  asso¬ 
ciate  the  interval  (D,  -,D.),  i<i£k.  Let  P  be  a  nonleaf 
node.  Let  the  intervals  associated  with  its  left  and  right 
children,  respectively  be  (t.,ti)  and  (ti,tR).  The  interval 
associated  with  P  is  (t,,tRr.  The  interval  associated  with 
the  root  is  therefore  (0,Dk).  Figure  2.il  shows  the  compu¬ 
tation  tree  for  our  example.  The  interval  associated  with 
each  node  is  also  shown. 


The  set  J(P)  of  jobs  associated  with  node  P  consists 
precisely  of  those  jobs  that  have  a  due  time  d.  such  that 
fcL<di— fcR  where  (t.rtj.)  is  the  interval  associated  with  P. 
Withreach  node  LP,  we  may  also  associate  two  sets  of  jobs, 
S(P)  and  R(P).  Consider  the  job  sequencing  with  deadlines 
problem  defined  by  the  job  set  J(P).  Assume  that  all  jobs 
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Figure  2 . i 0 

have  a  release  time  t..  S(P)  consists  exactly  of  those  jobs 
in  J(P)  that  will  beLscheduled  to  finish  by  their  due  times 
in  an  optimal  schedule  for  J(P).  R(P)  consists  of  the 
remaining  jobs  in  J(P).  Once  S(root  node)  is  known,  the 
optimal  schedule  for  the  overall  job  sequencing  problem  is 
also  known. 


For  the  leaf  nodes, 
For  each  leaf  node  P, 
J(P)  with  largest  weight 
leaf  node,  S(P)  and  R(P) 
of  its  children.  Let  Q  a 
right  children  of  P.  Let 
T,  respectively,  be  (t.  , 

R (T)  and  let  V  be  theLse 
jobs  of  W  with  largest  we 
see  that  S(P)  *  V  U  S(T) 
(W-S(P)).  The  S  and  R  se 
example  are  also  given  in 


S(P)  and  R (P) 
S(P)  consist 
(see  Figure  2 
are  computed 
nd  T,  respect 
the  intervals 
ti)  and  (tp, 
t  consisting 
ights.  It  is 
.  Hence  R(P) 
ts  for  each  o 
Figure  2.1i. 


are  easily  obtained, 
s  of  the  tR-t,  jobs  of 
,ii).  If  P  isL  a  non- 
from  the  S  and  R  sets 
ively,  be  the  left  and 
associated  with  Q  and 
tD) .  Let  W  =  S (Q)  U 
of  the  min{|W|,  tR-t,} 
not  too  difficult  to 
-  J(P)  -  S(P)  *  R  (Q)  U 
f  the  nodes  in  our 


Once  the  S  and  R  sets  have  been  computed,  the  optimal 
schedule  can  be  obtained  by  sorting  S(root)  by  due  times  and 
appending  the  jobs  in  R(root)  to  the  end.  For  our  example, 
the  optimal  schedude  is  i0,  8,  5,  3,  7,  11,  9,  2,  i,  13,  4, 
12,  6,  i4.  The  sum  of  the  weights  of  the  tardy  jobs  is  255. 

Since  the  S  and  R  sets  are  maintained  in  nonincreasing 
order  of  weights,  the  merging  required  at  each  node  to  com¬ 
pute  S  and  K  can  be  carried  out  using  a  parallel  bitonic 
merge.  Hence,  all  the  computation  needed  at  each  level  of 
the  computation  tree  can  be  performed  in  O(log  n)  time  using 
n/2  PEs.  The  overall  complexity  for  our  job  sequencing  with 
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deadlines  algorithm  is  O(log  n)  and  the  EPU  is  O(i/log  n) . 
(In  [10]  Dekel  and  Sahni  show  how  to  solve  the  job  sequenc¬ 
ing  problem  in  O(log  n)  time.  This  algorithm  does  not  use 
the  binary  tree  method  and  has  an  EPU  which  is  considerably 
inferior  to  that  of  the  algorithm  developed  here.) 


Finally,  we  note  that  the  parallel  algorithm  developed 
to  minimize  the  number  of  tardy  jobs  when  m*i  and  r.*0,  can 
be  adapted  to  obtain  a  one  machine  schedule  that  minimizes 
the  sum  of  the  weights  of  the  tardy  jobs  provided  that  all 
jobs  have  agreeable  weights.  (All  jobs  have  agreeable 
weights  iff  p.<p.  implies  w.>w.  for  all  i  and  j.)  The 
sequential  algorithm  for  this  problem  is  an  extension  of  the 
Hodgson-Moore  algorithm  to  minimize  the  number  of  tardy 
jobs.  This  extension  is  due  to  Lawler[2i].  Also,  Sidney's 
[34]  extension  which  takes  into  account  jobs  that  must 
necessarily  be  completed  by  their  due  times  can  also  be 
solved  by  a  modified  version  of  our  algorithm. 


3.  Conclusions 


We  have  demonstrated  that  the  binary  computation  tree  is  a 
very  important  tool  in  the  design  of  efficient  parallel 
algorithms.  The  binary  tree  method  is  closely  related  to  the 
divide-and-conquer  approach  used  to  obtain  many  efficient 
sequential  algorithms  [15].  While  divide-and-conquer  algo¬ 
rithms  do  use  an  underlying  computation  structure  that  is  a 
tree,  the  use  of  this  tree  is  implicit.  Further,  only  one 
pass  over  this  tree  can  be  made  as  partial  results  computed 
in  the  various  nodes  are  not  saved  for  use  in  further 
passes.  In  this  respect,  the  binary  tree  method  is  more  gen¬ 
eral  than  divide-and-conquer.  The  single  pass  algorithms 
discussed  in  this  paper  can,  however,  be  just  as  well  viewed 
as  divide-and-conquer  algorithms. 

While  all  the  parallel  algorithms  discussed  in  this 
paper  have  assumed  that  as  many  PEs  as  needed  are  available, 
they  can  be  run  quite  easily  using  fewer  PEs.  The  complex¬ 
ity  of  course  will  increase  by  a  factor  of  q/k  where  k  is 
the  number  of  PEs  available  and  q  is  the  number  assumed  in 
the  paper. 
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tion  is  used.  Our  success  in  using  binary  trees  for  parallel  computations,  in¬ 
dicates  that  the  binary  tree  is  an  important  and  useful  design  tool  for 
parallel  algorithms. 
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